Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.
translated by 谷歌翻译
One of the important topics in the research field of Chinese classical poetry is to analyze the poetic style. By examining the relevant works of previous dynasties, researchers judge a poetic style mostly by their subjective feelings, and refer to the previous evaluations that have become a certain conclusion. Although this judgment method is often effective, there may be some errors. This paper builds the most perfect data set of Chinese classical poetry at present, trains a BART-poem pre -trained model on this data set, and puts forward a generally applicable poetry style judgment method based on this BART-poem model, innovatively introduces in-depth learning into the field of computational stylistics, and provides a new research method for the study of classical poetry. This paper attempts to use this method to solve the problem of poetry style identification in the Tang and Song Dynasties, and takes the poetry schools that are considered to have a relatively clear and consistent poetic style, such as the Hongzheng Qizi and Jiajing Qizi, Jiangxi poetic school and Tongguang poetic school, as the research object, and takes the poems of their representative poets for testing. Experiments show that the judgment results of the tested poetry work made by the model are basically consistent with the conclusions given by critics of previous dynasties, verify some avant-garde judgments of Mr. Qian Zhongshu, and better solve the task of poetry style recognition in the Tang and Song dynasties.
translated by 谷歌翻译
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
translated by 谷歌翻译
在神经影像分析中,功能磁共振成像(fMRI)可以很好地评估没有明显结构病变的脑疾病的大脑功能变化。到目前为止,大多数基于研究的FMRI研究将功能连接性作为疾病分类的基本特征。但是,功能连接通常是根据感兴趣的预定义区域的时间序列计算的,并忽略了每个体素中包含的详细信息,这可能会导致诊断模型的性能恶化。另一个方法论上的缺点是训练深模型的样本量有限。在这项研究中,我们提出了Brainformer,这是一种用于单个FMRI体积的脑疾病分类的一般混合变压器架构,以充分利用素食细节,并具有足够的数据尺寸和尺寸。脑形形式是通过对每个体素内的局部提示进行建模的3D卷积,并捕获两个全球注意力障碍的遥远地区之间的全球关系。局部和全局线索通过单流模型在脑形中汇总。为了处理多站点数据,我们提出了一个归一化层,以将数据标准化为相同的分布。最后,利用一种基于梯度的定位图可视化方法来定位可能的疾病相关生物标志物。我们在五个独立获取的数据集上评估了脑形形成器,包括Abide,ADNI,MPILMBB,ADHD-200和ECHO,以及自闭症疾病,阿尔茨海默氏病,抑郁症,注意力缺陷多动障碍和头痛疾病。结果证明了脑形对多种脑疾病的诊断的有效性和普遍性。脑形物可以在临床实践中促进基于神经成像的精确诊断,并激励FMRI分析中的未来研究。代码可在以下网址获得:https://github.com/ziyaozhangforpcl/brainformer。
translated by 谷歌翻译
多语言神经文本到语音(NTTS)系统的基本设计决策是如何表示模型中的输入语言特征。查看文献中各种各样的方法,出现了两个主要范式,统一和单独的表示。前者在跨语言中使用一组共享的语音令牌,而后者为每种语言使用独特的语音令牌。在本文中,我们进行了一项全面的研究,比较了两种表示训练的多语言NTTS系统模型。我们的结果表明,统一方法始终在自然和口音方面始终获得更好的跨语性综合。单独的表示形式往往比统一的代币更大的令牌,这可能会影响模型容量。因此,我们进行了一项消融研究,以了解表示类型与令牌嵌入尺寸的相互作用。我们发现,两个范式之间的差异仅在一定阈值嵌入尺寸之上出现。这项研究提供了有力的证据,表明在构建多语言NTTS系统时,统一表示应该是首选的范式。
translated by 谷歌翻译
培训仅使用单语言语料库的多语言神经文本到语音(NTTS)模型已成为构建基于语音克隆的Polyglot NTTS系统的流行方式。为了训练这些模型,必须了解培训语料库的组成如何影响多语言语音综合的质量。在这种情况下,通常会听到诸如“包含更多西班牙数据有助于我的意大利综合,考虑到两种语言的亲密关系?”之类的问题。不幸的是,我们发现有关该主题缺乏完整性的现有文献。在目前的工作中,我们进行了一项广泛的消融研究,旨在了解培训语料库的各种因素(例如语言家族隶属关系,性别组成和演讲者的数量)如何有助于多面化综合的质量。我们的发现包括在大多数情况下首选女性扬声器数据的观察结果,并且在培训语料库中拥有更多来自目标语言的说话者并不总是有益的。此处的发现对于数据采购和语料库构建过程提供了信息。
translated by 谷歌翻译
除了图像分类外,对比性语言图像预训练(剪辑)还为广泛的视觉任务(包括对象级别和3D空间理解)取得了非凡的成功。但是,将从剪辑中学到的语义知识转移到更复杂的量化目标任务,例如使用几何信息的深度估计。在本文中,我们建议将剪辑应用于零拍的单眼估计,称为Depthclip。我们发现,输入图像的斑块可以响应一定的语义距离令牌,然后将其投影到量化的深度箱中进行粗估算。在没有任何培训的情况下,我们的深度算法超过了现有的无监督方法,甚至可以接近早期全面监督的网络。据我们最大的知识,我们是第一个从语义语言知识进行零拍调整的人,以量化下游任务并执行零拍的单眼估计。我们希望我们的工作能够阐明未来的研究。该代码可在https://github.com/adonis-galaxy/depthclip上找到。
translated by 谷歌翻译
该技术报告提出了一种有效的自动驾驶运动预测方法。我们开发了一种基于变压器的方法,用于输入编码和轨迹预测。此外,我们提出了时间流动头来增强轨迹编码。最后,使用了有效的K均值集合方法。使用我们的变压器网络和集合方法,我们以1.90的最新Brier-Minfde得分赢得了Argoverse 2 Motion预测挑战的第一名。
translated by 谷歌翻译
对比视觉语言预培训(剪辑)最近淹没了其可转让的视觉表现学习的关注。由大规模的图像文本对进行监督,剪辑能够对准配对的图像和文本,从而在开放词汇场景中进行零拍摄识别。然而,特定应用与通常预先训练的知识之间存在语义差距,这使得匹配子最优在下游任务上。在本文中,我们提出了VT-CLIP通过可视导向文本来增强视觉语言建模。具体而言,我们指导文本功能以自适应地探索图像上的信息区域,并通过跨关注的Machanism聚合视觉特征。以这种方式,视觉引导文本与图像变得更加语义相关,这极大地利益匹配过程。在几次拍摄的设置中,我们在11名知名分类数据集中评估我们的VT-CLIP,并进行实验广泛的消融研究,以证明VT-CLIP的有效性。代码将很快发布。
translated by 谷歌翻译
在这项工作中,我们提出了一个新的和一般的框架来防御后门攻击,灵感来自攻击触发器通常遵循\ textsc {特定}类型的攻击模式,因此,中毒训练示例在彼此期间对彼此产生更大的影响训练。我们介绍了{\ IT影响图}的概念,它包括分别代表各个训练点和相关的对方式的节点和边缘组成。一对训练点之间的影响代表了去除一个训练点对另一个训练点的影响,由影响函数\ citep {koh2017understanding}近似。通过查找特定大小的最大平均子图来提取恶意训练点。关于计算机视觉和自然语言处理任务的广泛实验证明了所提出的框架的有效性和一般性。
translated by 谷歌翻译